February 27, 2019

Towards Intersubjectivity

Research interests

  • Substantive research interests
    • Broader question: Emergence of AfD as party and parliamentary presence - what are the effects on party competition and parliamentarism?
    • Descriptive (preliminary) question: What are the prevalent framings in speeches given by AfD parliamentarians?
    • Contagion hypothesis (diffusion): (Speakers of) other parliamentary groups may take over framings offered by AfD speakers.
    • cp. DFG-Projekt “The populist challenge in parliament” (2019-2021, cooperation with Christian Strecker, Marcel Lewandowsky, Jochen Müller)
  • Methodological interests
    • Validity and intersubjectivity of data-driven, “distant reading” approaches (in the eHumanities)
    • ML/AI: Annotation to gain training data for statistical learning => gold standard annotation, scores to evaluate trained models
    • Social sciences: Traditions of coding and annotating text data: Quantitative/qualitative content analysis

Focus of the presentation

  • Combining distant and close reading is an unfulfilled promise: Software often inhibits combining both perspectives. How to implement workflows for coding and annotating textual data? The polmineR R package (and a few brothers and sisters) are presented as potential solution.

  • Special focus: Interactive graph annotation as an approach to generate intersubjectively shared interpretations/understandings of discourse patterns.

  • Schedule:
    • Theory is code
    • The MigParl corpus
    • AfD Keywords
    • Graph annotation
    • Conclusions

Theory is code

Combining R and CWB

A design for close and distant reading

  • Why R?
    • the most common programming language in the social sciences
    • comprehensive availability of statistical methods
    • great visualisation capabilites
    • usability: RStudio as IDE
    • reproducible research: Rmarkdown notebooks
  • Why the Corpus Workbench (CWB)?
    • a classic toolset for corpus analysis
    • indexing and compression of corpora => performance
    • powerful and versatile syntax of the Corpus Query Processor (CQP)
    • permissive license (GPL)
  • NoSQL / Lucene / Elasticsearch are potential alternatives - but not for now

The PolMine Project R Packages

The “triple”:

  • polmineR: basic vocabulary for corpus analysis

  • RcppCWB: wrapper for the Corpus Workbench (using C++/Rcpp, follow-up on rcqp-package)

  • cwbtools: tools to create and manage CWB indexed corpora

Beyond the triple:

  • GermaParl: documents and disseminates GermaParl corpus
  • frappp: framework for parsing plenary protocols
  • annolite: light-weight fulltext display and annotation tool
  • topicanalysis: workflows with quantitative/qualitative elements for topic models
  • gradget: graph annotation widget

polmineR: Objectives

  • performance: if analysis is slow, interaction with the data suffers

  • portability: painless installation on all major platforms

  • open source: no restrictions and inhibiting licenses

  • usability: make full use of the RStudio IDE

  • documentation: transparency of the methods implemented

  • theory is code: combine quantitative and qualitative methods

Getting started

  • Getting started with polmineR is easy: Assuming that R and RStudio are installed, polmineR can be installed as simple as follows (dependencies such as RcppCWB will be installed automatically).
install.packages("polmineR")
  • Get the GermaParl corpus(corpus of plenary debates in the German Bundestag).
drat::addRepo("polmine") # add CRAN-style repository to known repos
install.packages("GermaParl") # the downloaded package includes a small sample dataset
GermaParl::germaparl_download_corpus() # get the full corpus
  • That’s it. Ready to go.
library(polmineR)
use("GermaParl") # activate the corpora in the GermaParl package, i.e. GERMAPARL

polmineR - the basic vocabulary

  • create subcorpora: partition(), subset()

  • counting: hits(), count(), dispersion() (vgl.: size())

  • create term-document-matrices: as.TermDocumentMatrix()

  • get keywords / feature extraction: features()

  • compute cooccurrences: cooccuurrences(), Cooccurrences()

  • inspect concordances: kwic()

  • recover full text: get_token_stream(), html(), read()

Metadata and partitions/subcorpora

  • the good old workflow to create partitions (i.e. subcorpora)
p <- partition("GERMAPARL", year = 2001)
m <- partition("GERMAPARL", speaker = "Merkel", regex = TRUE)
  • the emerging new workflow …
am <- corpus("GERMAPARL") %>% subset(speaker == "Angela Merkel")

m <- corpus("GERMAPARL") %>% subset(grep("Merkel", speaker)) # Petra Merkel!

cdu_csu <- corpus("GERMAPARL") %>%
  subset(party %in% c("CDU", "CSU")) %>%
  subset(role != "presidency")

Counting and dispersions

dt <- dispersion("GERMAPARL", query = "Flüchtlinge", s_attribute = "year")
barplot(height = dt$count, names.arg = dt$year, las = 2, ylab = "Häufigkeit")

Concordances / KWIC output

q <- '[pos = "NN"] "mit" "Migrationshintergrund"'
corpus("GERMAPARL") %>% kwic(query = q, cqp = TRUE, left = 10, right = 10)

Validating sentiment analaysis

kwic("GERMAPARL", query = "Islam", positivelist = c(good, bad)) %>%
  highlight(lightgreen = good, orange = bad) %>%
  tooltips(setNames(SentiWS[["word"]], SentiWS[["weight"]])) %>%
  knit_print()

Full text output

partition("GERMAPARL", date = "2009-11-10", speaker = "Merkel", regex = T) %>%
  html(height = "250px") %>%
  highlight(list(yellow = c("Bundestag", "Regierung")))

Most likely terms for topics

p <- partition("BE", date = "2005-04-28", speaker = "Körting", regex = TRUE, verbose = FALSE)
sc <- corpus("BE") %>%  subset(date == "2005-04-28") %>% subset(grepl("Körting", speaker))
ek <- as.speeches(p, s_attribute_name = "speaker", verbose = FALSE)[[4]]
h <- get_highlight_list(BE_lda, partition_obj = ek, no_token = 150)
lapply(h, function(x) x[1:8])

Validating topic models

html(p, height = "350px") %>% highlight(highlight = h)

Data

The MigParl Corpus

  • Prepared in the MigTex Project (“Textressourcen für die Migrations- und Integrationsforschung”, funding: BMFSFJ)

  • Preparation of all plenary debates in Germany’s regional parliaments (2000-2018) using the “Framework for Parsing Plenary Protocols” (frappp-package)

  • Extraction of a thematic subcorpus using unsupervised learning (topic modelling)

  • Size of the MigParl corpus: 27241205 tokens

  • size without interjections and presidency: 22837376

  • structural annotation: id | speaker | party | role | lp | session | date | regional_state | interjection | year | agenda_item | agenda_item_type | speech | topics | harmonized_topics

MigParl by year

AfD in MigParl - tokens

AfD in MigParl - share

MigParl - regional dispersion

AfD Keywords

Term extraction I

Term extraction II (ADJA - NN)

Term extraction III (NN-ART-NN)

Graph Annotation

Ego-Networks

Leipzig Corpus Miner (LCM)

“Wuchern der Rhizome”

(Joachim Scharloth)


polmineR & cooccurrences

  • The cooccurrences()-method can be applied to subcorpora / partitions, and corpora.
cooccurrences("GERMAPARL", query = 'Islam', left = 10, right = 10)

Getting all cooccurrences

Starting with polmineR v0.7.9.11, the package includes a method to efficiently calculate all cooccurrences in a corpus.

m <- partition("GERMAPARL", year = 2008, speaker = "Angela Merkel", interjection = F)
drop <- terms(m, p_attribute = "word") %>% noise() %>% unlist()
Cooccurrences(m, p_attribute = "word", left = 5L, right = 5L, stoplist = drop) %>% 
  decode() %>% # 
  ll() %>%
  subset(ll >= 10.83) %>%
  subset(ab_count >= 5) -> coocs

AfD Cooccurrences

Graph visualisation (2D, N = 100)

Graph-Visualisierung (2D, N = 250)

Graph-Visualisierung (2D, N = 400)

Where we stand

  • Dependence of graph layout on filter decisions

  • Difficulties to justify filter decisions

  • Possibilities to provide extra information, but perils of information overload

  • Overwhelming complexity of large graphs

  • How to handle the complexity and create the foundations for intersubjectivity?

Graph visualisation (3D)

Conclusions

Conclusions

  • politeness of AfD

  • no autism? Interaction with other parties (and visitors!)

  • Cultivating antagonismus: “Wir” (AfD / AfD-Fraktion) and the others

  • It’s the economy: Introducing a redistributive Logic as a leitmotiv

  • “visual hermeneutics” (G. Schaal) - graph annotation as an approach to realise ideal of distant and close reading